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Abstract. A task overload condition often leads to high stress for an operator, causing performance degradation and possibly 
disastrous consequences. Just as dangerous, with automated flight systems, an operator may experience a task underload 
condition (during the en-route flight phase, for example), becoming easily bored and finding it difficult to maintain sustained 
attention. When an unexpected event occurs, either internal or external to the automated system, the disengaged operator may 
neglect, misunderstand, or respond slowly/inappropriately to the situation. In this paper, we discuss an approach for Operator 
Functional State (OFS) monitoring in a typical aviation environment. A systematic ground truth finding procedure has been designed 
based on subjective evaluations, performance measures, and strong physiological indicators. The derived OFS ground truth is 
continuous in time compared to a very sparse estimation of OFS based on an expert review or subjective evaluations. It can capture 
the variations of OFS during a mission to better guide through the training process of the OFS assessment model. Furthermore, an 
OFS assessment model framework based on advanced machine learning techniques was designed and the systematic approach 
was then verified and validated with experimental data collected in a high fidelity Boeing 737 simulator. Preliminary results show 
highly accurate engagement/disengagement detection making it suitable for real-time applications to assess pilot engagement. 


NOMENCLATURE 

ANOVA: ANalysis Of VAriance 

AT C: Air T raffic Controller 

ATP: Airline Transport Pilot 

BP: Back Propagation 

BPS: Boredom Proneness Scale 

CATS: Cognitive Avionics ToolSet 

CDU: Control & Display Unit 

ECG: Electrocardiogram 

EEG: Electroencephalogram 

GMM: Gaussian Mixture Model 

MEL: Multi-Engine Land 

NASA: National Aeronautics and Space 
Administration 

NATO: North Atlantic Treaty Organisation 
NN: Neural Network 


OFS: Operator Functional State 

PSD: Power Spectrum Density 

SART: Situational Awareness Rating Technique 

SEL: Single-Engine Land 

SVM: Support Vector Machine 

TLX: Task Load Index 

1.0 INTRODUCTION 

The primary focus of this research is to 
provide a real-time Operator Functional 
State (OFS) assessment mechanism. 
According to North Atlantic Treaty 
Organisation (NATO) [1], OFS is defined as 
the multidimensional pattern of human 
psychophysiological condition that mediates 
performance in relation to physiological and 
psychological costs. In aviation systems, 
two types of hazardous operator states are 
likely to lead to human errors [2]: a stress 
state due to high cognitive workload (we do 
not consider physical workload in this 
research) or a complacent/bored state due 
to extremely low cognitive workload over a 
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prolonged period of time [3], Proper 
assessment of cognitive workload and 
appropriate workload modulation offer the 
potential to improve mission effectiveness 
and aviation safety in both overload and 
underload conditions [3]- [6], In commercial 
flights (especially long-haul flights), pilots 
often experience periods of high workload 
during pre-flight preparations, takeoff and 
landing, as well as longer periods of very 
low workload as the pilot cruises enroute 
toward the destination with the aircraft on 
autopilot. Pilots can easily become 
disengaged during the enroute phase as 
they may be less attentive under low 
workload. When unexpected events occur, 
especially for fatigued pilots, 
disengagement could lead to operational 
errors. Such events can include unexpected 
changes in weather (turbulence, for 
example), equipment failure/malfunction 
(such as hydraulic pump failure) or potential 
collisions with other aircraft. 

During the past few years, we have been 
developing a systematic approach for OFS 
assessment [7] [8], In our previous research, 
we have conducted preliminary studies on 
identifying the ground truth for OFS, which 
is required to train an OFS assessment 
model. Data is needed to train an OFS 
assessment model based on input signals 
(subjective/objective measures) and 
associated operator states, such as 
engaged/disengaged labels. Many existing 
studies have utilized psychophysiological 
measurements to index the level of 
cognitive demand associated with a task 
[2][3], fatigue [9][1 0], engagement [1 1 ][1 2], 
and other functional state dimensions [1 ]. 
However, most of them label OFS based on 
subjective evaluations and are often very 
sparse temporarily (only a few labels during 
the whole experiment). In our research, we 
have identified several types of information 
sources that can indirectly infer 
engagement, improving the ground truth 
finding procedure described in [7] to take 
into account the degradation and recovery 
of an OFS due to factors such as changes 
in workload. At the same time, we 


developed an enhanced committee 
machine-based model for engagement 
assessment. The developed techniques 
have been verified and validated with 
experimental data collected from a Boeing 
737 simulator. Initial results show accurate 
real-time assessment of pilot engagement 
state. 

The remainder of the paper is organized as 
follows. Section 2 describes the flight 
simulation configuration and experiment 
design. Section 3 presents a mechanism to 
determine the ground truth for engagement 
modeling. Section 4 describes an enhanced 
committee machine based real-time 
engagement assessment model. Section 5 
shows preliminary performance evaluation 
results. Section 6 concludes the paper. 

2.0 ENGAGEMENT ASSESSMENT 
EXPERIMENT DESIGN 

In order to study engagement, we 
conducted experiments in a fully equipped 
Boeing 737 simulator [13] involving 
commercial pilots. In an earlier study 
conducted by Ellis [13], he described the 
functionality of the simulator as a fully 
functional flight deck with full glass cockpit 
displays, five outside visual projectors, 
functioning mode control panel with 
autopilot and autothrottle, and standard 
Boeing 737 controls (Figure 1). 



Figure 1. Boeing 737-800 simulator 
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2.1 Participants 

Several subjects have participated in the 
pilot engagement study. Pilots had varying 
levels of experience with different types of 
aircraft. All had an instrument rating and 
commercial/private/ATP (Airline Transport 
Pilot) licenses with experiences in Single- 
Engine Land (SEL), Multi-Engine Land 
(MEL), Jet or Turboprop. As an example, 
Table 1 lists experience levels of four 
subjects and the type of licenses they hold. 


Table 1. Example of participant flight 
experience and licenses 


Subject 

SEL 

MEL 

Jet 

Turboprop 

Licenses 

1 

45 

0 

0 

0 

Commercial 

2 

15 

0 

0 

0 

Commercial 

3 

7 

3.5 

2.5 

0 

Commercial, ATP 

4 

22 

14 

14 

1 

ATP 


2.2 Experimental design 

The experiments involved a flight from 
Seattle Tacoma International Airport to 
Chicago O’Hare International Airport. The 
details of the flight have been extracted 
from an actual American Airlines flight which 
took place on May 10th, 2010. Details were 
provided on flightaware.com. The flight path 
is represented by the blue line in Figure 2. 



Figure 2. Simulation flight path 


In order to study the effects of sleep-loss 
related fatigue on engagement, all pilots 
were scheduled to arrive at 5:30pm and 
were asked to avoid drinking caffeinated 
beverages such as coffee on the day of the 
experiment. An orientation video was shown 
to the subjects before the simulated flight. 
The video contained a description of the 
experiment as well as a Control & Display 
Unit (CDU) programming training section. 
The video included a description of the 
sensors and video recording devices used 


during the experiment, as well as the 
responsibilities that the pilots would have 
during the experiment. The details shared 
with the subjects did not include information 
on the probes that were used to measure 
engagement levels so that the pilots did not 
anticipate these probes throughout the 
experiment. During the flight simulation, one 
of the staff controlled the simulation 
computer to play pre-recorded audio files 
mimicking ATC transmissions. An 
experimenter was in charge of tagging the 
data to make sure that proper labels were 
added to the data sheets to identify the 
phases of the experiment as well as the 
times when the pilot responded to ATC. At 
the end of the experiment, the subjects filled 
out a subjective survey to assess their 
workload, fatigue and situational awareness 
during different phases of flight. 

The flight simulation included three events 
inserted into the flight scenario. The events 
were scheduled to occur at predetermined 
times to observe and measure how pilots 
responded. The first event was an ATC call 
asking the pilot to report when the aircraft 
was at 29000 feet. This call came while the 
aircraft was crossing 19000 feet. The aim of 
this event was to assess whether the pilot 
would remain engaged at the early stage of 
initial ascent. The second event was 
another ATC call that asked the pilot to 
report their position at 20 miles east of HLN 
(one of the waypoints). This call came at the 
early stages of level flight. The goal of this 
event was to determine whether the pilot 
would remember to call back ATC at the 
designated point. The third probe was a 
failure event. Half of the subjects received a 
failure signal 1 hour into the flight 
simulation. The other half received the 
signal 3 hours into the flight simulation. This 
approach was preferred because if all runs 
had both failures, the pilots may have 
remained in an engaged state throughout 
the flight after the first failure, with the 
expectation that such failures may be 
inserted into the scenario to test his/her 
performance. The data collected during 
these events can be compared to establish 
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the difference in the two engagement states 
in terms of physiological measures and 
subjective ratings. The event was selected 
such that it wouldn't prompt a drastic 
decision such as an emergency landing but 
would allow the pilot to solve the problem 
with onboard capabilities. 

2.3 Subjective evaluations 

In the experiments to be performed, we 
included several subjective rating scales 
that were collected after each experiment, 
including: Situational Awareness Rating 
Technique (SART [14]), Bedford workload 
scale [15], NASA Task Load Index (NASA 
TLX [15]), Samn-Perilli fatigue scale [16], 
and Boredom Proneness Scale (BPS) [17], 

In order to minimize the effects of intrusive 
questioning, a post experiment survey was 
conducted. Each subject was asked about 
his/her boredom proneness/perceived level 
of workload, and situation awareness and 
fatigue during different phases of flight. 

2.4 Objective observations 

In additional to flight technical data (altitude, 
speed, etc.), objective data collection was 
achieved with the use of three types of 
sensors, including eye tracking cameras, a 
EEG net, and an EKG sensor. The data was 
streamed into the Cognitive Avionics 
Toolset (CATS), which is an analysis tool for 
operator functional state assessment. A 
snapshot of CATS is shown in Figure 3. 


In addition, performance data, such as 
response time to ATC calls or pump failure, 
were also collected. 

3.0 ENGAGEMENT GROUND TRUTH 
FINDING 

Before an engagement assessment model 
can be deployed, it needs to be trained 
based on the engagement ground truth and 
corresponding input information 
(physiological signals, performance, and 
others). However, there does not exist a 
sensor to provide engagement ground truth. 
In this paper, we created an engagement 
ground truth assessment model 
incorporating subjective evaluation, 
behavioral measures (such as 
communications with ATC and real time 
performance data), and sensor measures 
(such as EEG, eye tracking and EKG data). 

The subjective evaluation data was 
collected after each pilot completed the 
flight simulation. We divided the whole flight 
simulation into 11 phases from takeoff to 
taxi and to gate (plus a special phase when 
the pilot was handling the pump failure), as 
shown in Table 2. For each phase, each 
pilot gave a score for each dimension in 
SART, Bedford workload scale [15], NASA 
TLX, and Samn-Perilli fatigue scale. Each 
pilot also rated his boredom proneness 
based on the survey. 

Table 2. Phases during the flight 



Figure 3. EEG signal view in CATS 


Takeoff to 19K 

19K to 29K 

29K to 37K 

Seattle Center 

Failure/Seattle Center 

Salt Lake Center 

Minneapolis Center 

Chicago Center 

Chicago Center to Call to Descend 

Call to Descent to Leveling at 9000 

Final Descent to Land 

Touchdown and Taxi to Gate 
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To derive an engagement profile as ground 
truth for OFS model training, we need to 
consider different sources of information. 
Three major steps are followed: baseline 
construction, degradation/recovery, and 
refinement based on strong indicators. 

1. Baseline construction. Engagement 
baseline is constructed based on 
possible incentives/motivations. A pilot 
with strong motivation or in a mission 
with high incentive usually has a 
relatively higher engagement level. In 
the initial study, we set the engagement 
baseline to a constant highest level for 
simplicity. 

2. Degradation/recovery. Engagement 
status usually changes due to 
workload/task change and/or 
occurrence of unexpected events. 
Expected events include regular ATC 
calls or corresponding replies. Although 
those expected events do not have a 
precise time schedule, their happening 
would not surprise the pilots. When 
expected events happen, the operator’s 
engagement level only increases by the 
minimum amount necessary to 
accomplish the known task, and 
decreases to the previous amount 
quickly thereafter. 

On the other hand, unexpected events 
are those that the pilots are not 
prepared for. In our experiment, the 
pilots were not aware of the pump 
failure event in advance. We 
hypothesize that a pilot has a more 
rapid engagement recovery when an 
unexpected event happens, and it can 
keep the pilot alert for a longer period of 
time, which indicates a slower 
degradation speed in engagement level 
thereafter. 

3. Refinement based on strong 
indicators. Strong disengagement 
/engagement indicators based on 
measurements, such as eye 
closure/head drooping (due to fatigue) 
indicating a disengaged state and short 
R-R interval (fast heart beat) indicating 


an engaged state shall be utilized for 
engagement refinement. In our initial 
research, described in this paper, we 
used pilot’s R-R (heart beat) interval as 
an indicator for engagement level. High 
R-R interval values imply a relaxed 
stage in which the pilot’s engagement 
level will degrade, and low R-R interval 
values indicate an engagement recovery 
stage. 

The degradation/recovery speed of 
engagement is unique to each individual. 
This individual difference may be estimated 
through objective measures, such as an 
individual’s fitness level, or with subjective 
evaluations, such as the boredom 
proneness scale. An easily bored operator 
usually gets distracted faster, and the lower 
the workload is, the faster the engagement 
level drops. In summary, the schema is 
shown in Figure 4. 

4.0 ENGAGEMENT ASSESSMENT 
MODEL 

With the engagement level labeled, an 
engagement assessment model has been 
trained. The inputs to the engagement 
assessment model include EEG, eye 
tracking and EKG data. These signals were 
first preprocessed (filtering, outlier removal, 
artifact removal), and the most relevant 
features were extracted and selected. 



Figure 4. Engagement ground truth finding 

Given the input feature sets, a committee 
machine method has been implemented to 
relate these features to an OFS. The basic 
idea of a committee machine [18] is to 
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aggregate the outputs from several OFS 
assessment models/committee members 
specified by users. Different algorithms can 
function as committee members, for 
example, Neural Networks (NNs), Gaussian 
Mixture Models (GMMs), and Support 
Vector Machines (SVMs). In this research, 
the committee machine was implemented 
using a multilayer perceptron trained by the 
standard Back Propagation (BP) algorithm 
as the base classification model. 

5.0 EXPERIMENTAL RESULTS 

The developed techniques were evaluated 
with the experimental data collected through 
a Boeing 737 flight simulator. During data 
collection, a sensor set up problem caused 
most of the ECG data to become 
contaminated and this data was not used for 
the initial evaluation. Next, we describe the 
results of engagement ground truth finding, 
data processing, and OFS modeling 
evaluation results. In [19] (in press), we 
presented detailed results on sensor data 
processing, feature analysis, and modeling. 

5.1 Ground Truth Finding 

The engagement level of operator varies 
during the entire period of flight and cannot 
be determined a priori. In this paper, we 
create a benchmark engagement based on 
commonly accepted assumptions. 

Taking off and landing are tasks in which 
pilots usually are highly engaged since 
these tasks are relatively more challenging. 
Also, when pump failures happen and pilots 
realize the failures, they will be fully 
engaged due to the emergency condition. 
On the other hand, during level flight, pilots 
tend to be at a low engagement level due to 
flight automation. 

Therein, we set a high engagement level 
(100) at takeoff, landing and pump failure 
handling phases, and set a low engagement 
level (30) during level flight periods. A 
middle engagement level (80) is assigned to 
climbing and descending tasks. Also, high 
and medium engagement task periods are 
followed by a 10 minute extension since it 


takes time for the pilot to relax. Figure 5 
shows the benchmark engagement during 
the flight. Although the benchmark 
engagement is not an accurate prediction, it 
provides a reference to validate the 
proposed engagement assessment model. 



Time 

Figure 5. Benchmark Engagement 

Figure 6 shows the workload and R-R 
interval for subject 1 during the flight. High 
workload values are observed during takeoff 
and landing phases, as well as during the 
time spent handling the pump failure. Based 
on measurements of workload and 
observation during the experiment, we 
believe the fluctuations of R-R interval 
correlate well with and can contribute to the 
estimation of benchmark engagement 



Figure 6. Mean R-R interval (in green) and 
workload (in blue) during the flight 
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Based on the workload and heart rate 
information collected during the experiment, 
along with expected and unexpected 
events, our proposed engagement ground 
truth finding model generates a continuous 
real-time engagement evaluation, as shown 
in Figure 7. 



Figure 7. Engagement ground truth 

Again, we can see that high engagement 
level is experienced during takeoff, landing 
and failure handling phases. Low 
engagement is experienced during level 
flight periods. Meanwhile, the ground truth 
profile reflects the engagement state 
variation due to expected events, such as 
ATC calls. 

We calculated the Pearson's correlation 
between the generated engagement curve 
and benchmark. The correlation coefficient 
is 0.7894 with a p-value less than 1e-5, 
which indicates their correlation with a high 
confidence level. 

5.2 Experiment Data Preparation for 
OFS Modeling 

In this study, two kinds of engagement 
states and their time durations were first 
identified by watching videos. A pilot’s state 
during takeoff or while handling a pump 
failure was considered as 'engaged' (or as 
state 2) The pilot’s state during level flight 
without any manipulation or if napping was 
recognized as ‘disengaged’ (or as state 1). 
Calculated features can then be labeled 


with these states by aligning with the 
identified time information, as shown in 

Table 3. 

Table 3. Pilots’ engagement states 



Duration 

State 

comments 

1 

19:08:00- 19:18:00 

2 (Engaged) 

Taking off 

21:08:00-21:17:00 

1 (Disengaged) 

Level flight 

2 

19:52:00-20:03:00 

2 (Engaged) 

Taking off 

21:19:00-21:29:00 

1 (Disengaged) 

Level flight 

3 

19:13:00- 19:23:00 

2 (Engaged) 

Taking off 

21:54:00 - 22:04:00 

1 (Disengaged) 

Level flight 

4 

20:58:00-21:08:00 

2 (Engaged) 

Taking off 

23:25:00-23:35:00 

1 (Disengaged) 

Level flight 


5.3 EEG Data Processing 

The data processing procedure for the EEG 
sensors is shown in Figure 8. We start with 
removal of environmental and DC artifacts, 
then removal of EEG datasets with 
unreasonable measurements based on 
standard deviation (such as 0 indicating no 
signal collected), selection of EEG channels 
of interest, identification of 
spikes/excursions/amplifier saturation, 
removal of artifacts, calculation of Power 
Spectrum Density (PSD), and analysis with 
two different techniques, Fisher score and 
ANOVA, for engagement assessment. 



Figure 8. EEG data processing 

ABM [20] suggested bi-polar sites, Fz-POz, 
Cz-POz for engagement assessment, and 
C3-C4, Cz-POz, F3-Cz, F3-C4, Fz-C3, Fz- 
POz for workload assessment. Leonard J. 
Trejo [21] emphasized that mental fatigue 
was associated with Fz, P7 and P8. To 
analyze pilots’ mental states during the 
simulated flight, we started with all channels 
mentioned in these studies. The EEG 
sensors being used (actiCAP) did not 
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provide the channel POz and we selected 
Oz as a substitute, which is the nearest 
sensor to POz. To make it comparable, the 
sensor P7 and P8 were paired with Oz 
respectively. Finally, the selected EEG 
sensors were Fz-Oz, Cz-Oz, C3-C4, F3-Cz, 
F3-C4, Fz-C3, P7-Oz and P8-Oz. 

In this study, EEG absolute PSD variables 
for each 1-s epoch were computed. For 
each bipolar pair, the power spectrum within 
each band was summed up as a feature. All 
the features were analyzed based on a 
Fisher score, which is the normalized 
distance between data points belonging to 
different states. The larger the Fisher score 
is, the father the distance between different 
states is, indicating a better feature. By 
sorting the aforementioned features, we can 
find the most valuable features. 

Furthermore, we analyzed the features 
using ANOVA. For example, P8-Oz in 
Gamma band was examined and its PSD in 
an engaged state (X label: 2) is significantly 
higher than that in a disengaged state (X 
label: 1), as shown in Figure 9. The analysis 
results confirm the usefulness of the 
features being extracted. 



Figure 9. ANOVA for subject 2 (P8-Oz, 25- 
40Hz) 

5.4 Eye Tracking Data Processing 

Two types of eye movements have been 
studied: fixations and saccades with respect 
to attention allocation. Fixation is defined as 
a single point of gaze vector within a 


threshold of two degrees for a minimum 
duration of 200ms [13]. 

Saccadic movement is derived by counting 
a saccade as the movement from one 
fixation to the next. Saccadic movements 
are measured by saccadic distance (deg). 
Their Euclidian distance can be derived by 
determining the plane on which the fixation 
is occurring and identifying the distance 
between that specified location and the eye 
gaze origin. 

For an engaged pilot, the fixation duration is 
usually smaller than that of a disengaged 
pilot, who may be in a state of day dreaming 
or high fatigue. A disengaged pilot during 
the enroute phase may have longer fixation 
durations and/or increased saccade length 
due to decreased workload [13]. 

Figure 10 shows the fixation during the flight 
for subject 3. When failures happened 
around 20:00, we can see that failures are 
followed by a valley of fixation value, which 
implies that the engagement level increases 
when the pilot is faced with an emergent 
event (as shown in Figure 10). This verifies 
the feasibility of using fixation as an 
indicator for engagement evaluation. Similar 
observations can be made for saccades in 
the flight simulation. 


Srt|cct& 



Figure 10. Examples of eye fixation duration 
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5.5 OFS Assessment Modeling 
Performance 

We first evaluated the performance of the 
enhanced committee machine-based OFS 
assessment model using a 5-fold cross 
validation technique. More specifically, for 
each individual, we divided the whole 
dataset into five folds (equally) and trained 
an individual's model using four of them. 
The performance was evaluated by testing 
the model with the remaining fold. The 
performance results range from 97.2% to 
99.8% for all four subjects. 

We also built a model for each individual 
based on very limited training data, which 
contains the first X% of data samples for a 
subject in each engagement state (the value 
of X can be 5, 10, 15 and 20). The dataset 
was normalized by the mean and standard 
deviation of the extracted training data. The 
trained model was then tested with the 
remaining data from that subject. The 
evaluation results for the four individual 
models are shown in Figure 11. The 
detection accuracy for the OFS assessment 
of the four subjects with only 10% of data 
can reach 84.2%, 95.3%, 89.7%, and 
99.4%, respectively. 



12 3 4 

Figure 11. Performance evaluation results for 
four subjects. 

6.0 DISCUSSIONS 

In this research, we have successfully 
developed a systematic approach for 
engagement assessment. The approach is 
based on an understanding of the 
relationship between performance, 
workload, and engagement. Future tasks 
include enhancing the model with additional 


sensory information (flight technical data 
and ECG, for example) and continuing to 
further verify and validate the real-time 
assessment technique with additional 
participants’ data. 
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